Module 2: Why Visualize Data?#
Visualizations help us understand data quicker and in a more effective way, resulting in the ability to extract insights and deeper understanding of the data we’re working with.
If we were just given plain data, it may be difficult to extract patterns from it. However, by transforming this data into visual formats such as charts, graphs, and maps, we can more easily identify trends, correlations, and outliers that might not be apparent in raw data.
For example, time series data can be better understood through line graphs, which can reveal trends over time. Similarly, scatter plots can highlight relationships between variables, and heat maps can provide a clear view of data density or intensity. Visualizations also aid in communicating complex data findings to a broader audience, including those who may not have a technical background.
Interactive visualizations allow users to explore data dynamically, offering the ability to drill down into details and adjust parameters to see how outcomes change. This interactive element enhances the depth of analysis and supports better decision-making.
Without visualizations, the conclusions we reach my be vague and equivocal. For example, summary statistics are indeed helpful, but ultimately are only a partial piece of a bigger story. Moreover, they can be misleading if we haven’t actually plotted out the data yet.
Through the power of visualizations, we can immediately identify patterns within data, which we otherwise wouldn’t be able to from merely looking at raw data alone, or summary statistics too for that matter.
Clinical Questions#
Within this module, we aim to reason about the following questions:
Does whether a not a patient used RHC usage affect their length of stay in the hospital?
Does sex play a role in determining if a patient undergoes RHC usage, and are there any discrepancies between the sex of the patient and their length of stay in the hosptial?
Is there any association between the age of the patient and their length of stay in the hospital?
Imperative vs Declarative Visualizations#
When it comes to data visualization, there are 2 main approaches:
Imperative
Declarative
Imperative (low level)#
Imperative visualizations center around the exact construction details of plots, emphasizing the mechanics behind their creation. This approach relies heavily on the use of loops and low-level drawing commands, allowing for precise control over every aspect of the plot’s appearance.
While this method provides detailed customization and pinpoint control, it can become cumbersome and time-consuming when attempting to create more complex visualizations. Despite the possibility for detailed fine-tuning, this approach may not always be the most efficient choice for data representations.
Declarative (high level)#
On the other hand, declarative visualizations prioritize the discovery of significant relationships within data, rather than the mechanics of plot construction.
This approach involves linking columns of data to visual channels, such as axes, colors, or shapes, to effectively convey the underlying patterns and insights.
By focusing on the essence of the data, declarative visualizations provide the core information needed, without the hassle of having to manage minor plotting details. This method streamlines the creation process, allowing users to concentrate on interpreting the data rather than on the specifics of plot construction.
What is EDA (Explatory Data Analysis)?#
Exploratory Data Analysis (EDA) is a crucial step in the data analysis workflow where analysts use statistical graphs and other data visualization tools to summarize the main characteristics of a dataset. Simply put, EDA is the process of analyzing and visualizing data to uncover patterns, spot anomalies, and gain insights before actually any models are created.
It involves investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.
EDA helps in understanding the underlying structure of the dataset in question, and guides subsequent data processing and modeling. By revealing insights and relationships that may not be immediately obvious, EDA plays an important role in making decisions that are primarily data-driven.
Visualization Grammar#
In order to create informative and effective plots, we must incorporate a high-level grammar of visualizations.
Visualization grammar is a systematic approach to building visual representations of data. By breaking down the process into fundamental components, it allows for more flexible, reusable, and understandable visualizations. This concept is similar to how grammar in a language allows for the construction of meaningful sentences from a set of rules.
Visualization grammar is composed of 3 different grammatical components.
Creating a chart linked to a dataframe. The foundation of any visualization is a data structure, typically a dataframe, that contains the data to be visualized. The first step is to establish this link, ensuring the data is correctly loaded and accessible for plotting.
Adding graphical elements (points, lines, etc). Graphical elements, or marks, are the basic visual components used in a chart, such as points, lines, bars, and areas. These elements represent the data visually and can be combined in various ways to form different types of charts (e.g., scatter plots, line charts, bar charts).
Encoding dataframe columns as visual channels. Visual channels refer to the attributes of graphical elements that can be manipulated to represent data, such as position, size, shape, color, and orientation. Encoding involves mapping the columns of a dataframe to these visual channels, transforming raw data into a visual format that can be easily interpreted.
We will see later how we can incorporate visualization grammar with the Altair package to create visualizations, and thus perform exploratory data analysis.
The Dataset#
Many cardiologists believe that right heart catheterization (through the use of a monitoring device to measure cardiac function) is beneficial in helping manage patients in the ICU.
This RHC dataset was used in a significant study by Connors et al. (1996) published in the Journal of the American Medical Association. The study objective was to evaluate the effectiveness of RHC, also known as the Swan-Ganz catheter, in the initial care of critically ill patients. The data pertains to day 1 of hospitalization, focusing on whether or not a patient received RHC on the first day they qualified for the SUPPORT (The Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments).
Specifically, we aim to investigate potential associations between the use of Right Heart Catheterization (RHC) within the first 24 hours in the ICU and various health-related outcomes, such as length of stay.
Let’s first load in our data from the link https://hbiostat.org/data/repo/rhc.csv.
import pandas as pd
import numpy as np
ObsData = pd.read_csv("https://hbiostat.org/data/repo/rhc.csv").iloc[:,1:]
ObsData.head()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
2 import numpy as np
4 ObsData = pd.read_csv("https://hbiostat.org/data/repo/rhc.csv").iloc[:,1:]
ModuleNotFoundError: No module named 'pandas'
Creating the Analytic Data#
The outcome Y that we’re currently interested in, is the patient’s length of stay in the hospital.
To compute such a value, let’s take the date of discharge (date patient leaves hospital), and subtract it by the study admission date (date patient enters hospital).
# Creating length of stay
ObsData['Y'] = ObsData['dschdte'] - ObsData['sadmdte']
If the date of discharge is not available, we use date of death instead, subtracting it by study admission date.
missing_Y = ObsData['Y'].isna()
# Replace the missing values in 'Y' with the difference between 'dthdte' and 'sadmdte'
ObsData.loc[missing_Y, 'Y'] = ObsData['dthdte'][missing_Y] - ObsData['sadmdte'][missing_Y]
Now, let’s remove the outcomes we are not interested in for this example.
columns_to_drop = ['dthdte', 'lstctdte', 'dschdte', 'death', 't3d30', 'dth30', 'surv2md1']
ObsData = ObsData.drop(columns=columns_to_drop)
ObsData.head()
| cat1 | cat2 | ca | sadmdte | cardiohx | chfhx | dementhx | psychhx | chrpulhx | renalhx | ... | hema | seps | trauma | ortho | adld3p | urin1 | race | income | ptid | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | COPD | NaN | Yes | 11142 | 0 | 0 | 0 | 0 | 1 | 0 | ... | No | No | No | No | 0.0 | NaN | white | Under $11k | 5 | 9.0 |
| 1 | MOSF w/Sepsis | NaN | No | 11799 | 1 | 1 | 0 | 0 | 0 | 0 | ... | No | Yes | No | No | NaN | 1437.0 | white | Under $11k | 7 | 45.0 |
| 2 | MOSF w/Malignancy | MOSF w/Sepsis | Yes | 12083 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | NaN | 599.0 | white | $25-$50k | 9 | 60.0 |
| 3 | ARF | NaN | No | 11146 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | NaN | NaN | white | $11-$25k | 10 | 37.0 |
| 4 | MOSF w/Sepsis | NaN | No | 12035 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | NaN | 64.0 | white | Under $11k | 11 | 2.0 |
5 rows × 56 columns
Next, let’s remove variables deemed unnecessary and/or problematic.
columns_to_drop = ['sadmdte', 'ptid', 'adld3p', 'urin1', 'cat2']
ObsData = ObsData.drop(columns=columns_to_drop)
ObsData.head()
| cat1 | ca | cardiohx | chfhx | dementhx | psychhx | chrpulhx | renalhx | liverhx | gibledhx | ... | gastr | renal | meta | hema | seps | trauma | ortho | race | income | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | COPD | Yes | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | No | No | No | No | No | No | No | white | Under $11k | 9.0 |
| 1 | MOSF w/Sepsis | No | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | Yes | No | No | white | Under $11k | 45.0 |
| 2 | MOSF w/Malignancy | Yes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | No | No | No | white | $25-$50k | 60.0 |
| 3 | ARF | No | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | No | No | No | white | $11-$25k | 37.0 |
| 4 | MOSF w/Sepsis | No | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | No | No | No | white | Under $11k | 2.0 |
5 rows × 51 columns
The next step is to convert all categorical variables to factors.
# List of columns to convert to categorical
factors = ["cat1", "ca", "cardiohx", "chfhx", "dementhx", "psychhx",
"chrpulhx", "renalhx", "liverhx", "gibledhx", "malighx",
"immunhx", "transhx", "amihx", "sex", "dnr1", "ninsclas",
"resp", "card", "neuro", "gastr", "renal", "meta", "hema",
"seps", "trauma", "ortho", "race", "income"]
# Convert specified columns to category dtype
ObsData[factors] = ObsData[factors].astype('category')
Next, let’s convert our treament variable “swang1” to a binary variable, A. As a reminder, “swang1” indicates whether or not a patient recieved a RHC on the first day.
ObsData['A'] = np.where(ObsData['swang1'] == 'RHC', 1, 0)
Now that binary treatment A has been created, we can drop the swang1 column, as we no longer need it.
# Drop the 'swang1' column
ObsData = ObsData.drop(columns=['swang1'])
Now, let’s categorize the variables so that they match with the original paper.
# Set the order of categories for 'race'
ObsData['race'] = pd.Categorical(ObsData['race'], categories=['white', 'black', 'other'], ordered=False)
# Convert 'sex' to a categorical type and set reference level
ObsData['sex'] = pd.Categorical(ObsData['sex'], categories=['Male', 'Female'], ordered=False)
# Convert 'cat1' to a categorical type and set new levels
ObsData['cat1'] = ObsData['cat1'].replace({
"Cirrhosis": "Other", "Colon Cancer": "Other", "Coma": "Other",
"COPD": "Other", "Lung Cancer": "Other", "MOSF w/Malignancy": "MOSF", "MOSF w/Sepsis": "MOSF"
}).astype('category')
# Convert 'ca' to a categorical type and set new levels
ObsData['ca'] = ObsData['ca'].replace({"No": "None", "Yes": "Localized (Yes)"}).astype('category')
C:\Users\TESTUSER\AppData\Local\Temp\ipykernel_9532\3298340898.py:8: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
ObsData['cat1'] = ObsData['cat1'].replace({
C:\Users\TESTUSER\AppData\Local\Temp\ipykernel_9532\3298340898.py:14: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
ObsData['ca'] = ObsData['ca'].replace({"No": "None", "Yes": "Localized (Yes)"}).astype('category')
Lastly, let’s rename the variables so that they’re easier to work with and understand.
# Renaming the columns
ObsData.columns = ["Disease.category", "Cancer", "Cardiovascular",
"Congestive.HF", "Dementia", "Psychiatric", "Pulmonary",
"Renal", "Hepatic", "GI.Bleed", "Tumor",
"Immunosuppression", "Transfer.hx", "MI", "age", "sex",
"edu", "DASIndex", "APACHE.score", "Glasgow.Coma.Score",
"blood.pressure", "WBC", "Heart.rate", "Respiratory.rate",
"Temperature", "PaO2vs.FIO2", "Albumin", "Hematocrit",
"Bilirubin", "Creatinine", "Sodium", "Potassium", "PaCo2",
"PH", "Weight", "DNR.status", "Medical.insurance",
"Respiratory.Diag", "Cardiovascular.Diag",
"Neurological.Diag", "Gastrointestinal.Diag", "Renal.Diag",
"Metabolic.Diag", "Hematologic.Diag", "Sepsis.Diag",
"Trauma.Diag", "Orthopedic.Diag", "race", "income",
"Y", "A"]
Recall that Y is length of stay (outcome), and A is RHC use (treatment)
Introduction to Altair#
We will be using the Python package “Altair” to create our visualizations,designed for simplicity and high-level plotting. It can be used to create a wide range of interactive and visually appealing charts with minimal code, and rather specifically focus on the relationships within the data.
Altair is particularly useful because it allows users to generate complex visualizations without worrying about low-level plotting details, making data analysis more accessible and efficient. Its integration with visualization grammars ensures that users can create sophisticated and interactive visualizations effortlessly.
To start working with Altair, let’s first load in the package.
pip install altair
Collecting altair
Downloading altair-5.4.0-py3-none-any.whl (671 kB)
------------------------------------- 671.7/671.7 kB 14.1 MB/s eta 0:00:00
Requirement already satisfied: jinja2 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from altair) (3.1.4)
Requirement already satisfied: jsonschema>=3.0 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from altair) (4.17.1)
Collecting narwhals>=1.1.0 (from altair)
Downloading narwhals-1.3.0-py3-none-any.whl (138 kB)
-------------------------------------- 138.3/138.3 kB 8.5 MB/s eta 0:00:00
Requirement already satisfied: packaging in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from altair) (23.2)
Collecting typing-extensions>=4.10.0 (from altair)
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Requirement already satisfied: attrs>=17.4.0 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from jsonschema>=3.0->altair) (23.2.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from jsonschema>=3.0->altair) (0.19.2)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from jinja2->altair) (2.1.1)
Installing collected packages: typing-extensions, narwhals, altair
Attempting uninstall: typing-extensions
Found existing installation: typing_extensions 4.9.0
Uninstalling typing_extensions-4.9.0:
Successfully uninstalled typing_extensions-4.9.0
Successfully installed altair-5.4.0 narwhals-1.3.0 typing-extensions-4.12.2
Note: you may need to restart the kernel to use updated packages.
WARNING: Ignoring invalid distribution -cipy (c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages)
WARNING: Ignoring invalid distribution -cipy (c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages)
[notice] A new release of pip is available: 23.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
import altair as alt
Now, let’s examine the dimensions of our dataset.
ObsData.shape
(5735, 51)
It should also be noted that Altair only allows a maximum of 5000 rows/observations when creating visualizations. This is an issue, as our dataset has 5735 rows.
To address this problem, we can disable the MaxRows checker:
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
Now we are ready to proceed with our first visualization, where we aim to analyze the association between our treament (A) and outcome (Y) variables, which are RHC usage and length of stay, respectively.
Creating our First Visualization#
Let’s trace back to our understanding of visual grammars, and how that can be applied when creating plots in Altair.
For example, consider the following piece of Altair code:
alt.Chart(ObsData).mark_point().encode(x='A' , y='Y')
The above code block alt.Chart(ObsData).mark_point().encode(x='A', y='Y') can be described using the visualization grammar components as follows:
Creating a chart linked to a dataframe:
alt.Chart(ObsData)initializes a chart and links it to the dataframeObsData.
Adding graphical elements
.mark_point()specifies that the chart should use points to represent the data.
Encoding dataframe columns as visual channels:
.encode(x='A', y='Y')maps the dataframe column'A'to the x-axis and the column'Y'to the y-axis, defining how data is visually represented in the chart.
By incorporating the principles of visualization grammar, we can create clear and insightful visualizations. This structured approach enhances our ability to perform exploratory data analysis and communicate findings effectively.
The plot we have created compares how RHC is associated with the length of stay in the hospital. There does not seem to be any immediate indication that RHC has an effect on the length of stay. It appears that most patients typically stay less than 120 days.
Categorizing Visualizations by Colour#
What if we wanted to also map a categorical variable to the color scale?
We can utilize the ‘color’ argument of the .encode() function.
alt.Chart(ObsData).mark_point().encode(x='A' , y='Y', color='sex')